tg.observer

A dataset of public Telegram channels

Aleksi Knuutila

University of Helsinki

Roman Kyrychenko

University of Helsinki

Vasileios Maltezos

University of Helsinki

COMPTEXT 2024

Today

  1. Why study Telegram?
  2. Existing tools and gaps
  3. tg.observer
  4. Usage examples

Slides

https://slides.knuutila.net/comptext2024

Background

  • “Eyewitness images from Ukraine” 2023-2025
  • “Reimagining image verification”: 2025-2026

Why study Telegram?

  • Lenient content policy -> attracts pariah communities
  • Primary source for information for many conflicts
  • Used for mobilisation eg. opposing COVID-19 policies

-> Important to lower the cost of investigating Telegram

Data collection from Telegram

Nice API!

Data collection from Telegram

Nice API!

But poor global search :(

Typical research process

  1. Identify channels related to your search interest
  2. Follow references between channels to find larger set (snowball sampling)

-> Still requires using API, technical competency

Existing directories

  • Russian service TGStat often starting point for research (over 600 citations)

Problems:

  • Few channels from Europe
  • Opaque about methodology

tg.observer

tg.observer aims

  • Publishing channel metadata and citations graphs
  • Transparent enough to allow scholarly use
  • Complementarity with existing tools
  • Coverage of smaller user communities

Data available at https://tg.observer, currently with 350,000 accounts

Data collection loop

  • Initial list of channels from TGDataset
  • Build citation graph based on shared posts, mentions, and URLs referencing other channels
  • Identify next batch of channels
  • Retrieve all posts from these channels

Sampling techniques

  • Sampling from hidden network, constrained by rate limits
    -> How to prioritise?

  • Typically breadth-first search or exponential discriminative snowball sampling
    -> Problem: Bias towards high-degree nodes, smaller communities not represented

  • tg.observer crawls based Mutual Friend Crawling (Blenn et al. 2017) within all communities and languages
    \[ S_R = \frac{\text{degree of node f within community}}{\text{degree of node f in entire graph}} \]

Usage examples

Data exploration

import pandas as pd
import graphistry

nl_channels = pd.read_csv('https://tg.observer/tg_observer/telegram_channels.csv?language__exact=nl&_sort_desc=participants_count&_size=max')
edgelist = pd.read_parquet('https://tg.observer/static/edgelist_02052024.parquet')

edgelist = edgelist[edgelist.isin(nl_channels['username'].values).any(axis=1)]

graphistry.register(server="hub.graphistry.com")
graphistry.bind(source='source_channel', destination='target_channel').plot(edgelist)

Sampling channels

import networkx as nx
import pandas as pd

edgelist = pd.read_parquet('https://tg.observer/static/edgelist_02052024.parquet')
G = nx.from_pandas_edgelist(edgelist, source='source_channel', target='target_channel')

initial_node = "flat_chat_nederland"
depth_limit = 2

snowball_sample = nx.bfs_tree(G, source=initial_node, depth_limit=depth_limit)
list(snowball_sample)[0:10]
['flat_chat_nederland',
 'Cosmomentiras',
 'Frequency',
 'GesaraNederland',
 'IBIZAMagicIsland2Q21QWWG1WGA',
 'OmlinNews',
 'POntario',
 'TARTARIAITALIACHANNEL',
 'TheGoodRebel17',
 'TheTruthSeekersChannel']

Next steps

  • Publication explaining dataset
  • Continuous updates
  • Hear from early adopters
  • Applying it in own research: Mapping global Telegram, evaluating established sampling techniques

Thanks for listening!

  • https://tg.observer

  • https://slides.knuutila.net/comptext2024

  • aleksi.knuutila@helsinki.fi

  • https://knuutila.net

  • @knuutila